Unsupervised Learning on Country DataΒΆ

Data Science TutorialΒΆ

Danylo Voloshyn, Andreas Payer

IntroductionΒΆ

The purpose of this tutorial is to take you through the data science pipeline, from data collection to insight. We will be working with data about countries, their population, health, and economic data. We would like to cluster them to find out which properties are correlated and how countries can be grouped. This can reveal previously unknown similarities between countries and help us solve problems that they face using correlated properties (features)). For instance, countries that have a low life expectancy may also have low health expenditures, so we might look to that as a solution.

Then, we will group countries by these metrics using Principal Component Analysis (PCA). This may reveal interesting groupings of countries and help us understand how their features came to be. In particular, many countries in geographic proximity to each other have similar features by observation, so their continent or larger landmass (such as Afro-Eurasia or The Americas) may play a role in these properties. We would like to see if this is true.

We will be using the datasets found here - please download the CSV files and put them in the same directory as your code file:
Unsupervised Learning on Country Data
Population by Country - 2020

Data CollectionΒΆ

First, we will read the datasets from the CSV (Comma-Separated Values) files and load them into pandas DataFrames. Data collection can also be done using web scraping, databases, and other methods, but CSV files are common, especially on data science websites like Kaggle and Data.world. CSV files contain text separated by commas and newline characters, and you should be able to read it by opening it in your editor. This makes the data collection process very simple as well.

Pandas is a powerful data manipulation library that provides data structures and functions to work with structured data.

InΒ [Β ]:
import pandas as pd

df1 = pd.read_csv('Country-data.csv')
df2 = pd.read_csv('population_by_country_2020.csv')

Data ProcessingΒΆ

We want to combine the datasets and doing a simple inner join is sufficient because it yields plenty of countries (over 100) to examine. Performing an outer join would result in NaN values which would complicate later analysis and modeling, and because of the quantity of data we have, we find it reasonable to discard countries with missing data. However, we find that some countries are listed under slightly different names in the two datasets, so we will rename them to prevent misses when performing the inner join.

InΒ [Β ]:
missing = {'St. Vincent and the Grenadines': 'St. Vincent & Grenadines', 
           'Macedonia, FYR': 'North Macedonia',
           'Kyrgyz Republic': 'Kyrgyzstan',
           'Congo, Dem. Rep.': 'DR Congo',
           'Slovak Republic': 'Slovakia',
           'Cape Verde': 'Cabo Verde',
           'Lao': 'Laos',
           'Cote d\'Ivoire': 'CΓ΄te d\'Ivoire', 
           'Micronesia, Fed. Sts.': 'Micronesia', 
           'Congo, Rep.': 'Congo', 
           'Czech Republic': 'Czech Republic (Czechia)'}

for k in missing:
    df1.loc[df1['country'] == k, 'country'] = missing[k]

df = df1.merge(df2, left_on='country', right_on='Country (or dependency)')
df
Out[Β ]:
country child_mort exports health imports income inflation life_expec total_fer gdpp ... Population (2020) Yearly Change Net Change Density (P/KmΒ²) Land Area (KmΒ²) Migrants (net) Fert. Rate Med. Age Urban Pop % World Share
0 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553 ... 39074280 2.33 % 886592 60 652860 -62920.0 4.6 18 25 % 0.50 %
1 Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090 ... 2877239 -0.11 % -3120 105 27400 -14000.0 1.6 36 63 % 0.04 %
2 Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460 ... 43984569 1.85 % 797990 18 2381740 -10000.0 3.1 29 73 % 0.56 %
3 Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530 ... 33032075 3.27 % 1040977 26 1246700 6413.0 5.6 17 67 % 0.42 %
4 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200 ... 98069 0.84 % 811 223 440 0.0 2.0 34 26 % 0.00 %
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
162 Vanuatu 29.2 46.6 5.25 52.7 2950 2.62 63.0 3.50 2970 ... 308337 2.42 % 7263 25 12190 120.0 3.8 21 24 % 0.00 %
163 Venezuela 17.1 28.5 4.91 17.6 16500 45.90 75.4 2.47 13500 ... 28421581 -0.28 % -79889 32 882050 -653249.0 2.3 30 N.A. 0.36 %
164 Vietnam 23.3 72.0 6.84 80.2 4490 12.10 73.1 1.95 1310 ... 97490013 0.91 % 876473 314 310070 -80000.0 2.1 32 38 % 1.25 %
165 Yemen 56.3 30.0 5.18 34.4 4480 23.60 67.5 4.67 1310 ... 29935468 2.28 % 664042 56 527970 -30000.0 3.8 20 38 % 0.38 %
166 Zambia 83.1 37.0 5.89 30.9 3280 14.00 52.0 5.40 1460 ... 18468257 2.93 % 522925 25 743390 -8000.0 4.7 18 45 % 0.24 %

167 rows Γ— 21 columns

We now have inconsistency in the column names, as well as redundant columns, so we will rename the columns appropriately and drop the unnecessary ones. The redundant columns are 'Fert. Rate' and 'Country (or dependency)' because they are present in the other dataset and 'Net Change' because we already have the percent change in population, which is more appropriate as it accounts for population. Then, we drop the 3 rows containing missing values for the urban population feature. We then clean up the data by removing percentage signs and converting all values to numbers that can be processed easily. We also divide the total migrant data by the population for that country to get a population-adjusted migrant rate, which is more appropriate here and allows to better see relationships with migration.

InΒ [Β ]:
# drop unnecessary columns
df.drop(columns=['Fert. Rate', 'Country (or dependency)',
        'Net Change'], inplace=True)

# change inconsistent column names
df = df.rename(columns={'health': 'health_spend',
                        'inflation': 'gdp_grow',
                        'total_fer': 'fert',
                        'gdpp': 'gdp_percapita',
                        'Population (2020)': 'population',
                        'Yearly Change': 'pop_grow',
                        'Density (P/KmΒ²)': 'pop_density',
                        'Land Area (KmΒ²)': 'land_area',
                        'Migrants (net)': 'migrants',
                        'Med. Age': 'med_age',
                        'Urban Pop %': 'urban_pop',
                        'World Share': 'world_share'})

# remove countries with missing values
df = df.replace('N.A.', pd.NA)
print("Missing values by feature:\n")
print(df.isna().sum())
df.dropna(inplace=True)

# fix the index after dropping rows
df.index = range(df.index.size)

# remove percentage signs
df['pop_grow'] = df['pop_grow'].str[:-2]
df['urban_pop'] = df['urban_pop'].str[:-2]
df['world_share'] = df['world_share'].str[:-2]

# adjust migrant data
df['migrants'] /= (df['population'] / 100)

# make all values (except country name) numeric for processing
df.iloc[:, 1:] = df.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')
df
Missing values by feature:

country          0
child_mort       0
exports          0
health_spend     0
imports          0
income           0
gdp_grow         0
life_expec       0
fert             0
gdp_percapita    0
population       0
pop_grow         0
pop_density      0
land_area        0
migrants         0
med_age          0
urban_pop        3
world_share      0
dtype: int64
Out[Β ]:
country child_mort exports health_spend imports income gdp_grow life_expec fert gdp_percapita population pop_grow pop_density land_area migrants med_age urban_pop world_share
0 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553 39074280 2.33 60 652860 -0.161027 18 25 0.5
1 Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090 2877239 -0.11 105 27400 -0.486578 36 63 0.04
2 Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460 43984569 1.85 18 2381740 -0.022735 29 73 0.56
3 Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530 33032075 3.27 26 1246700 0.019414 17 67 0.42
4 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200 98069 0.84 223 440 0.000000 34 26 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
159 Uzbekistan 36.3 31.7 5.81 28.5 4240 16.50 68.8 2.34 1380 33551824 1.48 79 425400 -0.026416 28 50 0.43
160 Vanuatu 29.2 46.6 5.25 52.7 2950 2.62 63.0 3.50 2970 308337 2.42 25 12190 0.038918 21 24 0.0
161 Vietnam 23.3 72.0 6.84 80.2 4490 12.10 73.1 1.95 1310 97490013 0.91 314 310070 -0.082060 32 38 1.25
162 Yemen 56.3 30.0 5.18 34.4 4480 23.60 67.5 4.67 1310 29935468 2.28 56 527970 -0.100216 20 38 0.38
163 Zambia 83.1 37.0 5.89 30.9 3280 14.00 52.0 5.40 1460 18468257 2.93 25 743390 -0.043318 18 45 0.24

164 rows Γ— 18 columns

Exploratory Data Analysis & VisualizationΒΆ

We will create sorted plots of the features are created to help with understanding the feature distributions. Matplotlib is a popular graphical library, and we will use its pyplot interface to create the plots easily. We will use a 0-1 color picker to pick the colors for the plots, and feel free to change them to your liking.

InΒ [Β ]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(3, 6, figsize=(18, 9))

# define units for the y-axis
titles = ['Child Mortality', 'Exports', 'Health Spending', 'Imports', 
          'Income', 'GDP Growth', 'Life Expectancy', 'Fertility Rate', 
          'GDP Per Capita', 'Population', 'Population Growth', 'Population Density', 
          'Land Area', 'Migrants', 'Median Age', 'Urban Population', 'World Share']
units = ['deaths under 5 per 1000', '% of GDP', '% of GDP', '% of GDP', 
         'net per capita', '% annual', 'years', 'births/woman', '', '', 
         '% annual', 'people/kmΒ²', 'kmΒ²', '% of pop.', 'years', '%', '% of pop.']

for i, col in enumerate(df.drop(columns='country', inplace=False).columns):
  # select the location and plot there
  plt.sca(ax.flatten()[i])
  plt.bar(df.index, df[col].sort_values(ascending=True), width=1, color=(0.122, 0.31, 0.929))
  plt.title(titles[i])
  plt.ylabel(units[i])
  
  # remove bottom labels as they don't help us here
  plt.tick_params(bottom = False, labelbottom=False)

# remove the last plot since we don't use it
plt.sca(ax.flatten()[-1])
plt.axis('off')

plt.tight_layout()
plt.show()
No description has been provided for this image

The distribution of various country metrics displays distinct patterns. Child mortality, exports, health spending, imports, income, GDP per capita, population, population growth, population density, land area, migrants, and world share all exhibit right-skewed distributions. This indicates that most countries have lower values for these metrics, with a few outliers possessing significantly higher values. For instance, most countries have low child mortality rates, small export percentages, and low health spending relative to their GDP, while only a few countries show extremely high values in these areas. Conversely, life expectancy and urban population are left-skewed, with most countries having high values in these metrics, such as high life expectancies and large percentages of urban population. Fertility rate and median age show somewhat unique patterns, with the fertility rate being right-skewed, indicating that most countries have low fertility rates, and median age showing a more balanced distribution, with most countries falling in a middle range. GDP growth and population growth display right-skewed distributions but with notable concentrations around lower values. This suggests that while most countries experience low to moderate growth rates, a few outliers exhibit very high growth. Migrants have a distribution centered around zero, indicating low net migration numbers for most countries, with some experiencing significant migration inflows or outflows.

Overall, the data highlights that most countries tend to cluster at lower values for many metrics, with fewer countries acting as outliers with much higher values, except for metrics like life expectancy and urban population where higher values are more common.

Now, we will plot the frequency distribution of each feature.

InΒ [Β ]:
# do the same as above but plot the distributions
fig, ax = plt.subplots(3, 6, figsize=(18, 9))

for i, col in enumerate(df.drop(columns='country', inplace=False).columns):
  plt.sca(ax.flatten()[i])
  plt.hist(df[col], bins=25, color=(0.063, 0.184, 0.8))
  plt.xlabel(col)
  plt.ylabel('frequency')

plt.sca(ax.flatten()[-1])
plt.axis('off')

plt.tight_layout()
plt.show()
No description has been provided for this image

We can see that the densities of some features follow a normal distribution, where the values reside near an average, called the mean. For instance, population growth and imports have this distribution, which appears often in data. Others are left-skewed or more like an exponential distribution, where the lowest values, often near zero, are the most common. Features such as child mortality and population seem to have this distribution. Some other features seem to be randomly distributed, like median age and urban population.

We will now create pair plots between the economic features to help with understanding the relationships among these statistics. We will use the seaborn library to create scatter plots of the features against each other. Seaborn is a powerful visualization library that can create complex plots with very little code. This will help us understand the relationships between the features. The main diagonal of the pair plot will show the distribution of each property, as above, providing context for that property. Each scatter plot will have a line of best fit because of the kind='reg' parameter, helping us see each correlation.

InΒ [Β ]:
import seaborn as sns

temp = df[['child_mort', 'health_spend', 'life_expec', 'fert', 'med_age']].apply(pd.to_numeric)
green = (0.02, 0.38, 0)
lightgreen = (0.02, 0.38, 0, 0.151)

sns.pairplot(temp, kind='reg', plot_kws={'color': green, 'line_kws': {
             'color': green}, 'scatter_kws': {'color': lightgreen}}, 
             diag_kws={'color': lightgreen, 'bins': 25})
plt.show()
No description has been provided for this image

These scatter plots reveal various relationships between economic features. There is a moderate positive correlation between imports and exports, indicating that countries with higher exports tend to import more. A similar, albeit weaker, trend is observed between imports and income, and imports and GDP per capita, suggesting that higher income countries and those with higher GDP per capita tend to import more. However, there is no significant relationship between imports and GDP growth, as the data points are widely scattered. Exports also show a moderate positive correlation with income and GDP per capita, with higher income countries and those with higher GDP per capita tending to export more. Similar to imports, there is no significant relationship between exports and GDP growth. Income and GDP per capita display a strong positive correlation, where higher income countries generally have higher GDP per capita. This relationship is more pronounced at lower income levels and becomes less linear at higher income levels. On the other hand, GDP growth shows no clear trend with imports, exports, income, or GDP per capita, indicating no straightforward relationship with these features.

In summary, while imports, exports, and income have moderate to strong correlations with each other and with GDP per capita, GDP growth does not exhibit clear relationships with the other economic features.

We will also create pair plots between the health features to help with understanding the relationships among these statistics.

InΒ [Β ]:
temp = df[['imports', 'exports', 'income', 'gdp_grow', 'gdp_percapita']].apply(pd.to_numeric)
red = (0.8, 0.086, 0.086)
lightred = (1, 0.255, 0.255, 0.42)

sns.pairplot(temp, kind='reg', plot_kws={'color': red, 'line_kws': {
             'color': red}, 'scatter_kws': {'color': lightred}}, 
             diag_kws={'color': lightred, 'bins': 25})
plt.show()
No description has been provided for this image

These scatter plots reveal several key relationships between health features. There is a strong negative correlation between child mortality and life expectancy, indicating that higher child mortality rates are associated with lower life expectancies. Similarly, higher child mortality rates are linked to higher fertility rates and lower median ages, showing moderate and strong negative correlations, respectively. Health spending shows a weak positive trend with both life expectancy and median age, suggesting that higher health spending is generally associated with better health outcomes and older populations, though the correlations are not strong. There is no significant relationship between health spending and fertility rates. Life expectancy and fertility rates display a strong negative correlation, where higher life expectancies are associated with lower fertility rates. Additionally, there is a strong positive correlation between life expectancy and median age, indicating that countries with higher life expectancies tend to have older populations. Lastly, the fertility rate and median age exhibit a strong negative correlation, suggesting that higher fertility rates are found in countries with younger populations.

Overall, the data reveal that child mortality, life expectancy, fertility rates, and median age are strongly interrelated, whereas health spending shows weaker associations with these features.

We will also create pair plots between the population features to help with understanding the relationships among these statistics. We will not include population or land area because population density is a function of those.

InΒ [Β ]:
temp = df[['world_share', 'pop_grow', 'pop_density', 'urban_pop', 'migrants']].apply(pd.to_numeric)
blue = (0, 0.306, 0.702)
lightblue = (0.02, 0.259, 0.569, 0.51)

sns.pairplot(temp, kind='reg', plot_kws={'color': blue, 'line_kws': {
              'color': blue}, 'scatter_kws': {'color': lightblue}}, 
             diag_kws={'color': lightblue, 'bins': 25})
plt.show()
No description has been provided for this image

The scatter plots illustrate the relationships between various population features, revealing mostly weak or no significant correlations. World share shows weak negative trends with both population density and urban population, indicating that countries with a larger share of the world population tend to have lower population densities and urban populations, though these relationships are not strong. There is no significant relationship between world share and population growth or net migration numbers. Population growth does not exhibit clear trends with any other features, suggesting that population growth rates are independent of population density, urban population, and migration numbers. Similarly, population density shows no significant correlations with urban population or net migration numbers, indicating that population density does not strongly correlate with these features. Lastly, the urban population also shows no clear trends with net migration numbers, suggesting that the percentage of the population living in urban areas is independent of migration trends. Overall, the scatter plots highlight the complexity and independence of demographic characteristics across different countries, with most population features not showing strong correlations with each other.

Now, we will create a pairplot with all the features to see the relationships between them all at once. The previous pair plots were separated by category, so we did not see relationships between features in different categories, which will be accomplished here. Here, we also exclude GDP per capita because its very similar to income and space is already very limited. This plot will take a bit longer to run than the previous pair plots, so please be patient.

Note: You can right click on the plot and save it as an image to view it in full size.

InΒ [Β ]:
temp = df.drop(columns=['country', 'gdp_percapita', 'population', 'land_area']).apply(pd.to_numeric)

purple = (0.063, 0.184, 0.8)
lightpurple = (0.063, 0.184, 0.8, 0.151)

sns.pairplot(temp, kind='reg', plot_kws={'color': purple, 'line_kws': {
    'color': purple}, 'scatter_kws': {'color': lightpurple}}, 
             diag_kws={'color': lightpurple, 'bins': 25}
)

plt.show()
No description has been provided for this image

InsightsΒΆ

From this, we find that income is correlated with child mortality, life expectancy, and fertility, population growth is correlated with child mortality and fertility, median age is correlated with health spending, income, and population growth, and urban population is correlated with exports, income, life expectancy, and median age. Some of these make sense and some do not at first glance, but they are interesting relationships to explore further. For instance, it doesn't immediately make sense that median age increases as urban population increases, but it could be the case that urbanized countries tend to have healthier populations that live longer and a low fertility rate, so the old population prevails. On the other hand, as fertility increases, population growth does as well. When more children are born, the population increases at a higher rate, so this relationship is more obvious. Also interesting is that population density, world share, gdp growth, and migrants do not seem to have any strong relationships with other features.

Analysis & Machine LearningΒΆ

We will begin performing dimensionality reduction on the data, a form of unsupervised learning. Unsupervised learning is a type of machine learning that learns patterns in unlabeled datasets. Dimensionality reduction is the process of transforming data from a higher-dimensional vector space to a lower-dimensional one. More specifically, in our context, we seek to reduce the dimensionality of our country data to be 2 dimensions, allowing us to plot the countries as points on a 2D graph and find clusters and anomolies in the data.

To perform the dimensionality reduction, we use PCA (Principal Component Analysis), a method whereby a linear transformation between the higher-dimensional and lower-dimensional space is learned that maximizes variance explained by each of the PCA components.

Before performing PCA, we perform some feature engineering, data cleaning and augmentation. We create a new GDP feature by multiplying our GDP per capita and population features. This is important because, without this feature, our clusters and representations will not capture the size of the countries' economies. We drop the country, world share, GDP per capita, and population density columns. The country column is not useful because we want to learn a reprentation of each country's statistics and not a representation of the countries' names. World share is redundant feature because it is simply a scaled version of the population feature. GDP per capita is a very similar statistic to income, so we didn't want to include both. Lastly, population density is also redundant because it can be calculated by dividing population feature by the land area feature. We only want to include unique features in our dimensionality reduction.

We also scale our features to have a mean of 0 and a standard deviation of 1 before performing PCA. This is important because PCA requires 0-mean centering, and without standardization, PCA will prioritize the features with higher variance. We want each feature to be given equal importance.

Because PCA is a linear dimensionality reduction technique, it can only find linear relationships between our features. However, some of our features may have other types of relationships, such as logistic ones. To address this limitation, we performed a logistic transformation on some of our features to allow for PCA to learn logistic relationships. To decide which features to transform, we performed PCA with and without the transformations and looked at the variance explained by the PCA analysis. The features which yielded higher variance explained when logistically transformed were the ones we selected for transformation.

Next, we create a scatter plot of the newly projected points in 2D space, labeling them with their country names and color-coding them by their continent.

We can see that are clusters of similar countries forming. On the right side of the plot, we see there is a greater concentration of Western-style democracies. Many of the countries in this area are also located in Europe, indicated by their red color. On the left side, we can see there is a concentration of less-developed countries in yellow and orange, signifying that they are located in Africa and Asia. This simple 2D plot allows us to see which countries are similar to one another in regards to their economic, health, and population statistics by looking at the distance between their points.

We then repeat this process of performing PCA and plotting the new points, except now we only run PCA on a subset of features. The first subset is health, containing the child mortality, health spending, life expectancy, median age, and fertility features. The second is economic, containing GDP, imports as a percent of GDP, exports as a percent of GDP, income, and GDP growth rate. The third and last is the population subset, containing population, population growth rate, urban population, migrants, fertility rate, and median age.

The variance explained by the PCA analyses for all, health, economic, and population features are 63%, 92%, 73%, and 75% respectively.

InΒ [Β ]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from matplotlib.pyplot import figure
from matplotlib.patches import Patch
import numpy as np

scaler = StandardScaler()

df['gdp'] = df['gdp_percapita'] * df['population']
copy = df.drop(columns=['country', 'world_share', 'gdp_percapita', 'pop_density'], inplace=False)

for x in ['population', 'income', 'fert', 'child_mort', 'land_area', 'med_age', 'gdp']:
  copy[x] = np.log(copy[x].astype(float))

africa = ["Algeria", "Angola", "Benin", "Botswana", "Burkina Faso", "Burundi", "Cabo Verde", "Cameroon", "Central African Republic", "Chad", "Comoros", "DR Congo", "Congo", "CΓ΄te d'Ivoire", "Egypt", "Equatorial Guinea", "Eritrea", "Eswatini (Swaziland)", "Gabon", "Gambia", "Ghana", "Guinea", "Guinea-Bissau", "Kenya", "Lesotho", "Liberia", "Libya", "Madagascar", "Malawi", "Mali", "Mauritania", "Mauritius", "Morocco", "Mozambique", "Namibia", "Niger", "Nigeria", "Rwanda", "Senegal", "Seychelles", "Sierra Leone", "Somalia", "South Africa", "South Sudan", "Sudan", "Tanzania", "Togo", "Tunisia", "Uganda", "Zambia", "Zimbabwe"]

asia = ["Afghanistan", "Armenia", "Azerbaijan", "Bahrain", "Bangladesh", "Bhutan", "Brunei", "Cambodia", "China", "Cyprus", "Georgia", "India", "Indonesia", "Iran", "Iraq", "Israel", "Japan", "Jordan", "Kazakhstan", "Kyrgyzstan", "Laos", "Lebanon", "Malaysia", "Maldives", "Mongolia", "Myanmar", "Nepal", "Oman", "Pakistan", "Philippines", "Qatar", "Saudi Arabia", "Singapore", "South Korea", "Sri Lanka", "Syria", "Tajikistan", "Thailand", "Timor-Leste", "Turkey", "Turkmenistan", "United Arab Emirates", "Uzbekistan", "Vietnam", "Yemen"]

europe = ["Albania", "Austria", "Belarus", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia", "Czech Republic (Czechia)", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Moldova", "Montenegro", "Netherlands", "North Macedonia", "Norway", "Poland", "Portugal", "Romania", "Russia", "Serbia", "Slovakia", "Slovenia", "Spain", "Sweden", "Switzerland", "Ukraine", "United Kingdom"]

north_america = ["Antigua and Barbuda", "Bahamas", "Barbados", "Belize", "Canada", "Costa Rica", "Cuba", "Dominica", "Dominican Republic", "El Salvador", "Grenada", "Guatemala", "Haiti", "Honduras", "Jamaica", "Mexico", "Nicaragua", "Panama", "St. Kitts and Nevis", "St. Lucia", "St. Vincent & Grenadines", "Trinidad and Tobago", "United States"]

south_america = ["Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Guyana", "Paraguay", "Peru", "Suriname", "Uruguay", "Venezuela"]

oceania = ["Australia", "Fiji", "Kiribati", "Micronesia", "New Zealand", "Palau", "Papua New Guinea", "Samoa", "Solomon Islands", "Tonga", "Tuvalu", "Vanuatu"]

colors = list(map(lambda x:
                  'blue' if x in oceania else
                  'purple' if x in south_america else
                  'green' if x in north_america else
                  'red' if x in europe else
                  'orange' if x in asia else
                  'yellow' if x in africa else
                  'black', df['country']))

patches = [Patch(color='blue', label='Oceania'),
           Patch(color='purple', label='South America'),
           Patch(color='green', label='North America'),
           Patch(color='red', label='Europe'),
           Patch(color='orange', label='Asia'),
           Patch(color='yellow', label='Africa')]

titles = ['PCA - All features', 'PCA - Health features', 
          'PCA - Economic features', 'PCA - Population features']

pca = PCA(2)

for columns in [copy.columns,
                ['child_mort', 'health_spend', 'life_expec', 'med_age', 'fert'],
                ['gdp', 'imports', 'exports', 'income', 'gdp_grow'],
                ['population', 'pop_grow', 'urban_pop', 'migrants', 'fert', 'med_age']]:
  scaled_data = scaler.fit_transform(copy[columns])
  data_pca = pca.fit_transform(scaled_data)
  print("Variance explained: {}%".format(pca.explained_variance_ratio_.cumsum()[-1] * 100))

  figure(figsize=(20, 10), dpi=80)
  plt.scatter(data_pca[:, 0], data_pca[:, 1], c=colors, label=df['country'])
  for i, x in enumerate(df['country']):
    plt.annotate(x, (data_pca[i][0], data_pca[i][1]), fontsize=7)

  plt.legend(handles=patches)
  plt.title(titles.pop(0))
  
  plt.show()
Variance explained: 63.21882110565977%
No description has been provided for this image
Variance explained: 92.07966079672393%
No description has been provided for this image
Variance explained: 72.80403870773425%
No description has been provided for this image
Variance explained: 75.29802835880719%
No description has been provided for this image

In each plot, there is noticable clustering among countries of the same continent, supporting our hypothesis that a country's continent/geographical location is related to its health, economic, and population conditions. Interestingly, in all the plots, the points of the countries in Africa and Europe are the furthest apart and form relatively tight clusters, with the countries from the remaining continents inbetween them. This suggests that, under a global context, the economic, social, and population conditions of European and African countries are the most extreme, while countries from other continents, such as Asia, are generally in the middle in terms of how their conditions relate to other countries in these categories of health, economics, and population.

Insights & ConclusionΒΆ

From our analyses, we can conclude that geographic location plays a key role in many countries' health, economic development, and population structures and demographics. Countries on the same continent share similar characteristics along these 3 core categories. And among all the categories we examined, Europe and Africa also almost the most divided of any two continents, with other continents primarily residing inbetween them. Furthermore, there is high correlation between the features we examined, with our 2 component PCA analysis able to explain 63% of the variance between the 15 statistics. This provides strong evidence that the economic, health, and population conditions of a country are highly interrelated.

There are some outlier countries in these PCA findings, which stray from their continents and the whole group of countries. For instance, Luxembourg and Malta for the economic features, and the US for health features. It would be interesting to explore this further, which may be done in a future analysis.